We introduce the concept of unconstrained real-time 3D facial performancecapture through explicit semantic segmentation in the RGB input. To ensurerobustness, cutting edge supervised learning approaches rely on large trainingdatasets of face images captured in the wild. While impressive tracking qualityhas been demonstrated for faces that are largely visible, any occlusion due tohair, accessories, or hand-to-face gestures would result in significant visualartifacts and loss of tracking accuracy. The modeling of occlusions has beenmostly avoided due to its immense space of appearance variability. To addressthis curse of high dimensionality, we perform tracking in unconstrained imagesassuming non-face regions can be fully masked out. Along with recentbreakthroughs in deep learning, we demonstrate that pixel-level facialsegmentation is possible in real-time by repurposing convolutional neuralnetworks designed originally for general semantic segmentation. We develop anefficient architecture based on a two-stream deconvolution network withcomplementary characteristics, and introduce carefully designed trainingsamples and data augmentation strategies for improved segmentation accuracy androbustness. We adopt a state-of-the-art regression-based facial trackingframework with segmented face images as training, and demonstrate accurate anduninterrupted facial performance capture in the presence of extreme occlusionand even side views. Furthermore, the resulting segmentation can be directlyused to composite partial 3D face models on the input images and enableseamless facial manipulation tasks, such as virtual make-up or facereplacement.
展开▼